Estimating phylogenetic distances between genomic sequences based on the length distribution of k-mismatch common substrings
نویسندگان
چکیده
Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between two input sequences. Haubold et al. (2009) showed how the average number of substitutions between two DNA sequences can be estimated based on the average length of exact common substrings. In this paper, we study the length distribution of k-mismatch common substrings between two sequences. We show that the number of substitutions per position that have occurred since two sequences have evolved from their last common ancestor, can be estimated from the position of a local maximum in the length distribution of their k-mismatch common substrings.
منابع مشابه
Alignment Free String Distances for Phylogeny
In this paper, we compare the accuracy of four string distances to recover correct phylogenies of complete genomes . These distances are based on common words shared by raw genomic sequences and do not require preliminary processing steps such as gene identification or sequence alignment. Moreover, they are computable in linear time. The first distance is based on Maximum Significant Matches. T...
متن کاملkmacs: the k-mismatch average common substring approach to alignment-free sequence comparison
MOTIVATION Alignment-based methods for sequence analysis have various limitations if large datasets are to be analysed. Therefore, alignment-free approaches have become popular in recent years. One of the best known alignment-free methods is the average common substring approach that defines a distance measure on sequences based on the average length of longest common words between them. Herein...
متن کاملkmacs: the k-Mismatch Avera- ge Common Substring Approach for Phylogeny Reconstruction
The vast majority of sequence comparison methods for phylogeny reconstruction rely on pairwise or multiple sequence alignments. These approaches are in practice not usable for longer sequences such as complete genomes. For this reason alignment-free methods have recently become more popular because they are much faster and usually computable in linear time. Some of these methods are based on re...
متن کاملGenetic Diversity and Molecular Phylogeny of Iranian Sheep Based on Cytochrome b Gene Sequences
Phylogenetic relationships and genetic variation between two Iranian sheep breeds were analyzed using cytochrome b (cyt-b) gene sequences. The genomic DNA was isolated by salting out method and amplified cytochrome b gene using polymerase chain reaction restriction (PCR) method with a pair of primer. A partial sequence of cyt-b gene of Iranian sheep is 780 bp and contained 13 variable sites and...
متن کاملGenome analysis with inter-nucleotide distances
MOTIVATION DNA sequences can be represented by sequences of four symbols, but it is often useful to convert the symbols into real or complex numbers for further analysis. Several mapping schemes have been used in the past, but they seem unrelated to any intrinsic characteristic of DNA. The objective of this work was to find a mapping scheme directly related to DNA characteristics and that would...
متن کامل